17 research outputs found

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence

    Get PDF
    During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.Peer ReviewedPostprint (author's final draft

    Speculative multithreaded processors

    Get PDF
    En esta tesis se estudia el modelo de ejecución de los procesadores multithreaded especulativos así como los requisitos necesarios para su implementación. El modelo de ejecución se basa en la inserción de instrucciones de spawn dentro del código secuencial. De esta manera, la ejecución de un programa en estos procesadores es similar a cualquier otro hasta que se encuentra con un punto de spawn. Entonces, se crea un nuevo thread especulativo en el punto indicado por la instrucción de spawn y ambos threads se ejecutan en paralelo. Cuanto el thread creador llega al punto inicial del thread especulativo, se ha de verificar si la especulación ha sido correcta. En ese caso, el contexto del thread no especulativo se gradúa y se libera para uso futuro de más threads especulativos. En caso de que la verificación no haya sido correcta, se recupera el estado correcto. En este modelo de ejecución siempre hay un thread no especulativo y puede haber múltiples threads especulativos.Para soportar este modelo de ejecución, se necesita: i) hardware capaz de crear y gestionar threads especulativo y ii) un mecanismo de particionado para dividir los programas en threads especulativos. Se han estudiado varias plataformas para gestionar threads de forma concurrente. Por un lado, los procesadores clustered se benefician de menores retardos, menor potencia consumida y una menor complejidad aunque las latencias de comunicación sean mayores. Por otro lado, las arquitecturas centralizadas se benefician del hecho de compartir recursos y menor latencia de comunicación, pero la complejidad del hardware es mucho mayor. En cualquier caso, el hardware ha de ser capaz de ejecutar múltiples threads simultáneamente con el inconveniente de que algunos valores van a tener que compartirse mientras que otros son copias privadas. Es decir, el procesador deberá ser capaz de gestionar múltiples versiones de un mismo registro o posición de memoria para cada uno de los threads que se estén ejecutando.Además, se ha puesto especial énfasis en la gestión de las dependencias de datos entre los threads especulativos ya que tienen un impacto muy importante en el rendimiento del procesador. Encontrar threads independientes es casi imposible en aplicaciones irregulares, por tanto los threads especulativos necesitarán de valores producidos por otros threads especulativos. Se han estudiado dos mecanismos: sincronizar el thread productor y el thread consumidor y predecir los valores dependientes. En el primer caso, se han propuesto mecanismos para pasar el valor tan pronto como ha sido producido del productor al consumidor, especialmente en el caso de valores de memoria. Por otro lado, el segundo modelo es mucho más atrayente ya que si todos los valores dependientes fueran predichos de forma correcta, los threads pasarían a ejecutarse de forma independiente. Se han evaluado múltiples predictores de valores propuestos en la literatura y se ha presentado un nuevo predictor especialmente pensado para este tipo de arquitecturas que es el predictor de incremento. Este predictor usa la información de control de los threads especulativos para predecir los valores y los resultados obtenidos son muy prometedores aún con tamaños muy reducidos del predictor. Finalmente, el particionado de las aplicaciones afecta al rendimiento de este tipo de procesadores. Se han propuesto y evaluado varios esquemas de particionado. Una familia de estos esquemas asigna threads especulativos a construcciones de programa que por si solas proporcionan cierta independencia de control. Políticas de esta familia son aquellas que crean threads especulativos en iteraciones de bucles, continuaciones de bucles y continuaciones de subrutinas. La segunda familia de esquemas de particionadose ayuda de un análisis basado en profiling para encontrar las parejas de spawn más idóneas para cada uno de los códigos. De esta manera, aquellas partes del programa que cumplan las mejores características se seleccionan para crear threads especulativos. Algunos criterios de selección que han sido considerados en esta tesis han sido: la independencia de control, el tamaño mínimo de los threads, la independencia de datos y su predictabilidad. Los resultados obtenidos por ambas familias han sido muy significativos, aunque el esquema basado en técnicas de profile mejora los resultados obtenidos por la otra familia

    Value prediction for speculative multithreaded architectures

    Get PDF
    The speculative multithreading paradigm (speculative thread-level parallelism) is based on the concurrent execution of control-speculative threads. The efficiency of microarchitectures that adopt this paradigm strongly depends on the performance of the control and data speculation techniques. While control speculation is used to predict the most effective points where a thread can be spawned, data speculation is required to eliminate the serialization imposed by inter-thread dependences. This work studies the performance of different value predictors for speculative multithreaded processors. We propose a value predictor, the increment predictor, and evaluate its performance for a particular microarchitecture that implements this execution paradigm (Clustered Speculative Multithreaded architecture). The proposed trace-oriented increment predictor clearly outperforms trace-adapted versions of the last value, stride and context-based predictors, specially for small-sized history tables. A 1-KB increment predictor achieves a 73% prediction accuracy and a performance that is just 13% lower than that of a perfect value predictor.Peer ReviewedPostprint (published version

    A quantitative assessment of thread-level speculation techniques

    No full text
    Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit thread level parallelism have been proposed and significant performance gains have been reported. However, the sources of the performance gains are poorly understood as well as the impact of some design choices. In this work, the advantages of different thread speculation techniques are analyzed as are the impact of some critical issues including the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread unitsPeer ReviewedPostprint (published version

    A quantitative assessment of thread-level speculation techniques

    No full text
    Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit thread level parallelism have been proposed and significant performance gains have been reported. However, the sources of the performance gains are poorly understood as well as the impact of some design choices. In this work, the advantages of different thread speculation techniques are analyzed as are the impact of some critical issues including the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread unitsPeer Reviewe

    Thread-spawning schemes for speculative multithreading

    No full text
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer Reviewe

    Value prediction for speculative multithreaded architectures

    No full text
    The speculative multithreading paradigm (speculative thread-level parallelism) is based on the concurrent execution of control-speculative threads. The efficiency of microarchitectures that adopt this paradigm strongly depends on the performance of the control and data speculation techniques. While control speculation is used to predict the most effective points where a thread can be spawned, data speculation is required to eliminate the serialization imposed by inter-thread dependences. This work studies the performance of different value predictors for speculative multithreaded processors. We propose a value predictor, the increment predictor, and evaluate its performance for a particular microarchitecture that implements this execution paradigm (Clustered Speculative Multithreaded architecture). The proposed trace-oriented increment predictor clearly outperforms trace-adapted versions of the last value, stride and context-based predictors, specially for small-sized history tables. A 1-KB increment predictor achieves a 73% prediction accuracy and a performance that is just 13% lower than that of a perfect value predictor.Peer Reviewe

    Nonignorable nonresponse models for categorical survey data

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre-DSC:DXN020367 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Cosmological perturbations in an inflationary universe

    No full text
    SIGLEAvailable from British Library Document Supply Centre-DSC:DXN039770 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
    corecore